1 Large-Language Models (LLMs)

Table 1.1: LLMs
Provider Model Version Estimate Rank
1 anthropic Claude 3.7 Sonnet claude-3-7-sonnet-20250219 3.8580848 top
2 anthropic Claude 3.5 Sonnet claude-3-5-sonnet-20241022 3.4210271 top
3 xai Grok 3 Beta grok-3-beta 3.0488472 top
4 anthropic Claude 3 Haiku claude-3-haiku-20240307 0.3764656 bottom
5 cohere Command R command-r-08-2024 0.3764656 bottom
6 openai GPT-3.5 Turbo gpt-3.5-turbo 0.3299676 bottom
7 openai GPT-4o Mini gpt-4o-mini 0.2865677 bottom
8 google Gemini 2.5 Flash gemini-2.5-flash NA new

Building on our previous analysis, we selected models based on their performance. We chose 4 top1, which were consistently more consistent than chance, and 4 bottom models, which were consistently less consistent than chance in terms of deliberative reasoning.

2 Cases

Table 2.1: Cases
Case Survey N Participants
1 CCPS ACT Deliberative ccps 31
2 CSIRO WA energy_futures 17
3 Winterthur zh_winterthur 16

3 Surveys

Table 3.1: Surveys
survey considerations policies scale_max q_method
1 ccps 33 7 11 FALSE
2 energy_futures 45 9 11 FALSE
3 zh_winterthur 30 6 7 FALSE

4 Roles

Table 4.1: Roles
uid type article role description
1 eco ideology an ecologist focuses on environmental protection and sustainability, advocating for societal change to ecological limits
2 coa perspective a coastal resident endures chronic flooding and salinization, forced to relocate due to rising sea levels and intense storms worsened by climate change
3 ctr perspective a construction worker suffers from extreme heat stress and lost work hours, perceiving climate change making outdoor labor unbearable and life-threatening
4 dis perspective a disease survivor recovers from dengue fever, aware that climate change’s rising temperatures are expanding the range of disease-carrying mosquitoes in their region
5 eld perspective an elderly urban resident endures intensified city heatwaves, struggling with disrupted services and feeling the direct, severe impact of climate change
6 far perspective a displaced family loses their home due to unprecedented wildfires, experiencing displacement and recognizing climate change as the major driver of the devastation
7 fis perspective a fisher notes his declining catches due to warming oceans, understanding that climate change is reorganizing marine life and reducing their traditional yield
8 lan perspective a landowner surveys his parched fields after a prolonged drought, feeling the compounding impacts of climate change that reduce crop yields and family income
9 par perspective a parent sees their child fall ill from a water-borne disease, attributing its spread to the increased heavy rainfall and warmer temperatures brought by climate change
10 sub perspective a subsistence farmer watches his crops wither under erratic rainfall patterns, and who sees these changes as direct consequence of climate change
11 vil perspective a villager faces dwindling, contaminated water supplies due to extended draughts and floods, aware that climate change is altering their water security
12 csk devils a climate skeptic prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science

5 Methods

5.1 Data collection

We collected 1440 responses generated by 8 models cross 3 surveys and 12 roles described above. We prompted each LLM 5 times with the same prompt.

5.1.1 System prompt (Roles)

We instructed LLMs to play each of the roles described above by including a system instruction in each request following the pattern:

Answer the following prompts as [article] [role], who [description].

For example:

Answer the following prompts as a climate skeptic, who prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science.

5.2 Analysis

We calculated one DRI value per model/survey/role by treating each LLM response as one participant in a deliberation. The role “all” indicates that all roles were part of that deliberation (n = 60 participants, which equals 5 participants for each of the 12 roles). DRI plots are shown in Figure 7.3.

6 Hypotheses Testing

6.1 H1a: random data

6.2 H1b: one-sample Wilcoxon signed rank test

model survey obs_mean N mu p_value_two.sided sig_two.sided p_value_greater sig_greater
Claude 3.5 Sonnet ccps 0.3759073 12 0 0.0009766 * 0.0004883 *
Claude 3.5 Sonnet energy_futures 0.4695921 12 0 0.0009766 * 0.0004883 *
Claude 3.5 Sonnet zh_winterthur 0.5683774 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet ccps 0.6819898 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet energy_futures 0.6173198 12 0 0.0004883 * 0.0002441 *
Claude 3.7 Sonnet zh_winterthur 0.5911667 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta ccps 0.3605863 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta energy_futures 0.7103851 12 0 0.0004883 * 0.0002441 *
Grok 3 Beta zh_winterthur 0.7314191 12 0 0.0004883 * 0.0002441 *
Gemini 2.5 Flash ccps 0.8336696 12 0 0.0004883 * 0.0002441 *
Gemini 2.5 Flash energy_futures 0.5166190 12 0 0.0009766 * 0.0004883 *
Gemini 2.5 Flash zh_winterthur 0.6778375 12 0 0.0004883 * 0.0002441 *
GPT-4o Mini ccps 0.0427425 12 0 0.6772461 n.s. 0.3386230 n.s.
GPT-4o Mini energy_futures -0.0899976 12 0 0.5693359 n.s. 0.7407227 n.s.
GPT-4o Mini zh_winterthur -0.2190937 12 0 0.0771484 n.s. 0.9680176 n.s.
GPT-3.5 Turbo ccps -0.2532340 12 0 0.0161133 * 0.9938965 n.s.
GPT-3.5 Turbo energy_futures -0.2836284 12 0 0.0122070 * 0.9953613 n.s.
GPT-3.5 Turbo zh_winterthur -0.4205772 12 0 0.0034180 * 0.9987793 n.s.
Command R ccps -0.4709172 12 0 0.0004883 * 1.0000000 n.s.
Command R energy_futures -0.0245292 12 0 0.7910156 n.s. 0.6333008 n.s.
Command R zh_winterthur -0.9582444 12 0 0.0004883 * 1.0000000 n.s.
Claude 3 Haiku ccps -0.3105968 12 0 0.0004883 * 1.0000000 n.s.
Claude 3 Haiku energy_futures -0.3584220 12 0 0.0009766 * 0.9997559 n.s.
Claude 3 Haiku zh_winterthur -0.6380549 12 0 0.0004883 * 1.0000000 n.s.

6.3 H2

## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | role) + (1 | survey)
##    Data: df
## 
## REML criterion at convergence: 127.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.86198 -0.63430  0.03286  0.59691  3.03838 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  role     (Intercept) 0.002483 0.04983 
##  survey   (Intercept) 0.005538 0.07442 
##  Residual             0.080233 0.28326 
## Number of obs: 288, groups:  role, 12; survey, 3
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)            -0.43569    0.06544  -6.658
## modelClaude 3.5 Sonnet  0.90698    0.06676  13.585
## modelClaude 3.7 Sonnet  1.06585    0.06676  15.964
## modelCommand R         -0.04887    0.06676  -0.732
## modelGemini 2.5 Flash   1.11173    0.06676  16.652
## modelGPT-3.5 Turbo      0.11654    0.06676   1.746
## modelGPT-4o Mini        0.34691    0.06676   5.196
## modelGrok 3 Beta        1.03649    0.06676  15.525
## 
## Correlation of Fixed Effects:
##             (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.510                                          
## mdlCld3.7Sn -0.510  0.500                                   
## modelCmmndR -0.510  0.500  0.500                            
## mdlGmn2.5Fl -0.510  0.500  0.500  0.500                     
## mdlGPT-3.5T -0.510  0.500  0.500  0.500  0.500              
## modlGPT-4Mn -0.510  0.500  0.500  0.500  0.500  0.500       
## modelGrk3Bt -0.510  0.500  0.500  0.500  0.500  0.500  0.500
## Linear mixed model fit by REML ['lmerMod']
## Formula: dri ~ model + (1 | survey/role)
##    Data: df
## 
## REML criterion at convergence: 128.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.95607 -0.66969 -0.00041  0.65619  3.06045 
## 
## Random effects:
##  Groups      Name        Variance  Std.Dev.
##  role:survey (Intercept) 0.0009013 0.03002 
##  survey      (Intercept) 0.0054477 0.07381 
##  Residual                0.0817355 0.28589 
## Number of obs: 288, groups:  role:survey, 36; survey, 3
## 
## Fixed effects:
##                        Estimate Std. Error t value
## (Intercept)            -0.43569    0.06412  -6.795
## modelClaude 3.5 Sonnet  0.90698    0.06739  13.460
## modelClaude 3.7 Sonnet  1.06585    0.06739  15.817
## modelCommand R         -0.04887    0.06739  -0.725
## modelGemini 2.5 Flash   1.11173    0.06739  16.498
## modelGPT-3.5 Turbo      0.11654    0.06739   1.730
## modelGPT-4o Mini        0.34691    0.06739   5.148
## modelGrok 3 Beta        1.03649    0.06739  15.381
## 
## Correlation of Fixed Effects:
##             (Intr) mC3.5S mC3.7S mdlCmR mG2.5F mGPT-T mGPT-M
## mdlCld3.5Sn -0.525                                          
## mdlCld3.7Sn -0.525  0.500                                   
## modelCmmndR -0.525  0.500  0.500                            
## mdlGmn2.5Fl -0.525  0.500  0.500  0.500                     
## mdlGPT-3.5T -0.525  0.500  0.500  0.500  0.500              
## modlGPT-4Mn -0.525  0.500  0.500  0.500  0.500  0.500       
## modelGrk3Bt -0.525  0.500  0.500  0.500  0.500  0.500  0.500
## boundary (singular) fit: see help('isSingular')
## refitting model(s) with ML (instead of REML)
## Data: df
## Models:
## m0: dri ~ 1 + (1 | survey/role)
## m1: dri ~ model + (1 | survey/role)
##    npar    AIC    BIC   logLik -2*log(L)  Chisq Df Pr(>Chisq)    
## m0    4 490.84 505.49 -241.420    482.84                         
## m1   11 118.59 158.89  -48.297     96.59 386.25  7  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##  model              emmean     SE   df lower.CL upper.CL
##  Gemini 2.5 Flash   0.6760 0.0641 7.44    0.526   0.8259
##  Claude 3.7 Sonnet  0.6302 0.0641 7.44    0.480   0.7800
##  Grok 3 Beta        0.6008 0.0641 7.44    0.451   0.7506
##  Claude 3.5 Sonnet  0.4713 0.0641 7.44    0.321   0.6211
##  GPT-4o Mini       -0.0888 0.0641 7.44   -0.239   0.0611
##  GPT-3.5 Turbo     -0.3191 0.0641 7.44   -0.469  -0.1693
##  Claude 3 Haiku    -0.4357 0.0641 7.44   -0.586  -0.2859
##  Command R         -0.4846 0.0641 7.44   -0.634  -0.3347
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95

## # A tibble: 12 × 3
##    role  mean_dri sd_dri
##    <chr>    <dbl>  <dbl>
##  1 coa     0.125   0.547
##  2 csk     0.287   0.550
##  3 ctr     0.189   0.457
##  4 dis     0.0416  0.564
##  5 eco     0.141   0.638
##  6 eld     0.149   0.531
##  7 far     0.0617  0.612
##  8 fis     0.0519  0.604
##  9 lan     0.170   0.506
## 10 par     0.111   0.608
## 11 sub     0.210   0.541
## 12 vil     0.0379  0.616
## # A tibble: 12 × 4
##    role  mean_role_noise max_role_noise min_role_noise
##    <chr>           <dbl>          <dbl>          <dbl>
##  1 coa             0.246          0.549        0.116  
##  2 csk             0.187          0.370        0.00776
##  3 ctr             0.299          0.402        0.106  
##  4 dis             0.217          0.369        0.00799
##  5 eco             0.233          0.517        0.0277 
##  6 eld             0.245          0.724        0.0452 
##  7 far             0.221          0.373        0.0647 
##  8 fis             0.192          0.566        0.0365 
##  9 lan             0.251          0.442        0.121  
## 10 par             0.304          0.559        0.0512 
## 11 sub             0.349          0.685        0.128  
## 12 vil             0.301          0.571        0.0186
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  sd_rep by role
## Fligner-Killeen:med chi-squared = 8.0891, df = 11, p-value = 0.7053
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  7  1.8873 0.08108 .
##       88                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  7  1.8873 0.08108 .
##       88                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.3.1 Noise

6.4 Example of model inconsistencies

Table 6.1: ccps survey questions
Question Statement Response Type
C1 There is not enough information to definitively say that climate change is real. Likert from 1 to 11
C2 The response to climate change is not going to be positive. The same mistakes will keep happening. Likert from 1 to 11
C3 Climate variation is normal, so why should this be a problem? Likert from 1 to 11
C4 More educational programmes are needed to increase public awareness about climate change. Likert from 1 to 11
C5 Climate change will not be a problem because there will be technological solutions available. Likert from 1 to 11
C6 I don’t trust what scientists say about climate change. Likert from 1 to 11
C7 I don’t trust what I hear about climate change from government. Likert from 1 to 11
C8 We need strong political leadership to do something about climate change. Likert from 1 to 11
C9 I think it is safe to say climate change is here. Likert from 1 to 11
C10 I’m not going to do anything to address climate change because it is not a major issue. Likert from 1 to 11
C11 There’s not much point in me doing anything to fix this. No-one else is going to. Likert from 1 to 11
C12 It’s difficult to trust what comes out in the media on the issue of climate change. Likert from 1 to 11
C13 It is already too late to do anything, as any action to stop climate change will take a long time to take effect. Likert from 1 to 11
C14 I’m not concerned enough to do anything drastic about this, such as participate in political action. Likert from 1 to 11
C15 It is unfair that we are going to leave the climate in a mess for future generations. Likert from 1 to 11
C16 We should pay for greenhouse emissions. Likert from 1 to 11
C17 We can adapt to the coming changes. Likert from 1 to 11
C18 It is clear that we are already entering the zone of dangerous climate change. Likert from 1 to 11
C19 I care about the planet. Likert from 1 to 11
C20 I don’t know what to do. I’m very concerned and would like to do something, but I don’t have a realistic shortlist of things that would really make a difference. Likert from 1 to 11
C21 Australia does not owe it to the rest of the world to reduce emissions and suffer economically. Likert from 1 to 11
C22 If Australia reduces greenhouse gases it won’t make a difference. That will just shift Australian jobs to other countries. Likert from 1 to 11
C23 This is so depressing and is so out of our control. Likert from 1 to 11
C24 I believe that the difference we can have as an individual, in Australia, is so minimal that our actions are worthless. Likert from 1 to 11
C25 Australia is particularly vulnerable to climate change, and it is in our interest to help find an effective global solution. Likert from 1 to 11
C26 We need laws addressing climate change because people are not going to volunteer to change. Likert from 1 to 11
C27 I want to do something, but it is too big and too hard. Likert from 1 to 11
C28 When I read in the paper that climate change is not true, I start to have doubts about whether it is changing. Likert from 1 to 11
C29 Doing something to reduce emissions feels a bit hopeless but I just want to feel that I’m doing the most I can. Likert from 1 to 11
C30 The fate of the planet is too important to be left to market forces. Likert from 1 to 11
C31 Australia’s emissions are tiny, so it’s not up to us to act. Likert from 1 to 11
C32 Governments should take a far greater role in preparing towns and cities to adapt to the impacts of climate change. Likert from 1 to 11
C33 Failure to address climate change is the fault of political leaders. Likert from 1 to 11
P1 Leave the policy settings as they are. Ranked-choice from 1 to 7
P2 Policies that emphasise economic growth over climate change adaptation or mitigation. Ranked-choice from 1 to 7
P3 Policies that involve a dramatic cut back in CO2 emissions (by 50% in the next 10 years). Ranked-choice from 1 to 7
P4 Policies that involve a moderate cut in CO2 emissions (by 25% in over the next 10 years). Ranked-choice from 1 to 7
P5 Adaptation policies and expenditure (e.g. coastal protection, water desalinisation, improving infrastructure etc). Planning controls and emergency response programs. Ranked-choice from 1 to 7
P6 Adaptation policies that target individual, small business or community-based actions (eg support the installation of alternative energy generators, insulation, water use efficiency) Ranked-choice from 1 to 7
P7 Preparing for climate risk through the development of new approaches and technologies that enhance resilience to the impacts of climate variability or change. Ranked-choice from 1 to 7

7 Findings

7.1 Consistency

We compared the compared top with bottom models in terms of consistency of DRI and Cronbach’s Alpha (see top models in Figure 7.1 and bottom models in Figure 7.2).

7.1.1 Top models

Top models

Figure 7.1: Top models

We found that top LLMs are consistent across roles both in terms of DRI and Cronbach’s Alpha (policies). The high DRI across roles (median = 0.637; IQR = 0.161) suggests that LLMs tend to consistenly align their considerations and policy preferences. The high Cronbach’s alpha for their policy preferences (median = 0.784; IQR = 0.047) suggests that LLMs tend to agree on the ranking of their policy preferences.

7.1.2 Bottom models

Bottom models

Figure 7.2: Bottom models

We also found that bottom LLMs are not consistent across roles in terms of DRI and less consistent than top models in terms of Cronbach’s Alpha (policies). The low DRI across roles (median = -0.177; IQR = 0.163) suggests that LLMs tend to consistenly misalign their considerations and policy preferences. The Cronbach’s alpha (lower than top models) for their policy preferences (median = 0.635; IQR = 0.11) suggests that LLMs tend to agree less on the ranking of their policy preferences than top models.

7.1.3 Summary for each model

7.1.3.1 DRI

Table 7.1: Mean DRI across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.512 0.639 -0.291 -0.281 0.638 -0.213 0.000 0.625 claude-3-7-sonnet-20250219
2 coa 0.350 0.565 -0.526 -0.435 0.810 -0.315 -0.019 0.567 gemini-2.5-flash
3 csk 0.543 0.773 -0.118 -0.580 0.875 0.163 -0.153 0.795 gemini-2.5-flash
4 ctr 0.343 0.567 -0.368 -0.264 0.663 -0.129 0.252 0.447 gemini-2.5-flash
5 dis 0.476 0.538 -0.553 -0.490 0.569 -0.719 0.057 0.455 gemini-2.5-flash
6 eco 0.364 0.720 -0.281 -0.831 0.854 -0.472 0.084 0.696 gemini-2.5-flash
7 eld 0.404 0.498 -0.335 -0.396 0.796 -0.078 -0.322 0.626 gemini-2.5-flash
8 far 0.479 0.651 -0.524 -0.673 0.821 -0.388 -0.370 0.497 gemini-2.5-flash
9 fis 0.497 0.593 -0.492 -0.560 0.685 -0.665 -0.244 0.602 gemini-2.5-flash
10 lan 0.595 0.633 -0.318 -0.347 0.477 -0.466 0.199 0.587 claude-3-7-sonnet-20250219
11 par 0.498 0.708 -0.669 -0.472 0.598 -0.164 -0.284 0.670 claude-3-7-sonnet-20250219
12 sub 0.526 0.712 -0.433 -0.218 0.556 -0.106 -0.014 0.654 claude-3-7-sonnet-20250219
13 vil 0.581 0.604 -0.612 -0.550 0.407 -0.490 -0.252 0.613 grok-3-beta

7.1.3.2 Cronbach’s Alpha (Policies)

Table 7.2: Mean alpha (policies) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.725 0.792 0.614 0.638 0.801 0.599 0.641 0.818 grok-3-beta
2 coa 0.713 0.745 0.816 0.808 0.771 0.737 0.763 0.807 claude-3-haiku-20240307
3 csk 0.783 0.802 0.813 0.708 0.848 0.764 0.715 0.851 grok-3-beta
4 ctr 0.749 0.791 0.774 0.776 0.918 0.787 0.727 0.755 gemini-2.5-flash
5 dis 0.761 0.772 0.669 0.802 0.771 0.762 0.756 0.796 command-r-08-2024
6 eco 0.764 0.844 0.711 0.730 0.814 0.800 0.759 0.716 claude-3-7-sonnet-20250219
7 eld 0.722 0.793 0.788 0.740 0.741 0.801 0.813 0.828 grok-3-beta
8 far 0.726 0.807 0.791 0.843 0.827 0.769 0.828 0.824 command-r-08-2024
9 fis 0.787 0.792 0.690 0.793 0.829 0.750 0.825 0.704 gemini-2.5-flash
10 lan 0.715 0.792 0.802 0.805 0.789 0.783 0.795 0.792 command-r-08-2024
11 par 0.785 0.704 0.774 0.777 0.790 0.778 0.762 0.833 grok-3-beta
12 sub 0.841 0.800 0.671 0.754 0.761 0.760 0.803 0.839 claude-3-5-sonnet-20241022
13 vil 0.708 0.818 0.770 0.794 0.808 0.786 0.798 0.662 claude-3-7-sonnet-20250219

7.1.3.3 Cronbach’s Alpha (Consideration)

Table 7.3: Mean alpha (considerations) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.990 0.990 0.976 0.975 0.984 0.911 0.976 0.987 claude-3-5-sonnet-20241022
2 coa 0.863 0.918 0.880 0.787 0.849 0.886 0.837 0.891 claude-3-7-sonnet-20250219
3 csk 0.769 0.856 0.898 0.767 0.551 0.952 0.817 0.831 gpt-3.5-turbo
4 ctr 0.916 0.909 0.872 0.915 0.852 0.916 0.852 0.906 claude-3-5-sonnet-20241022
5 dis 0.905 0.921 0.894 0.904 0.859 0.918 0.876 0.896 claude-3-7-sonnet-20250219
6 eco 0.900 0.860 0.884 0.827 0.842 0.865 0.871 0.863 claude-3-5-sonnet-20241022
7 eld 0.917 0.899 0.919 0.886 0.917 0.911 0.879 0.903 claude-3-haiku-20240307
8 far 0.905 0.848 0.919 0.747 0.815 0.774 0.860 0.905 claude-3-haiku-20240307
9 fis 0.916 0.895 0.894 0.907 0.896 0.918 0.891 0.905 gpt-3.5-turbo
10 lan 0.917 0.914 0.884 0.904 0.884 0.885 0.909 0.917 claude-3-5-sonnet-20241022
11 par 0.925 0.905 0.863 0.867 0.830 0.888 0.885 0.922 claude-3-5-sonnet-20241022
12 sub 0.902 0.919 0.895 0.758 0.851 0.889 0.906 0.911 claude-3-7-sonnet-20250219
13 vil 0.881 0.880 0.914 0.901 0.873 0.927 0.895 0.887 gpt-3.5-turbo

7.2 Model/Survey DRI Plots

These plots show a simulated deliberation across all 12 roles for each surveys and model. Each simulated deliberation has 60 participants (12 roles with 5 participants each).

Note that bottom models are visually inconsistent.

DRI Plots

Figure 7.3: DRI Plots

7.3 Survey/Role DRI Plots

These plots show a simulated deliberation across all models in the same class (i.e., top, bottom) for each role and survey. Each simulated deliberation has 20 participants (4 models with 5 participants each).

Note that top models are visually more consistent than bottom models.

7.3.1 Top models

7.3.2 Bottom models

7.4 Permutation tests

NOTE: This section is skipped by default. Remove the R code eval = FALSE to run the following chunks.

We conducted permutation tests with 10^{4} iterations to check which models are consistently consistent and which roles are consistently consistent.

7.4.1 Models and Surveys: Which models are truly consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.

Most models seem to be consistent across roles. Few of the 10,000 permutations led to a higher DRI than the observed DRI, suggesting that the observed value is likely not due to chance.

7.4.2 Surveys and Roles: Are models trully consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.


  1. Note that gemini-2.5-pro-preview-03-25 was replaced by gemini-2.5-pro, however, this version of the model became significantly slower and more expensive, since it has “thinking” enabled by default and cannot be toggled. As a result, we decided to use the flash version (gemini-2.5-flash), a lighter and cheaper alternative.↩︎